Paper: Auto-Encoding Variational Bayes

Paper

How to perform efficient approximate inference and learning with directed probablistic models whose continuous latent variable and/or parameters have intractable posterior.

Variational Bayesian involves the optimization of an approximation to the intractable posterior. The common Mean field approach requires analystical solution of expectation w.r.t the approximate posterior, which are also intractable in general case.

This paper shows how a reparameterization of variational lower bound yields a simple differentiable unbiased estimator of the lower bound.

This Stochastic Gradient Variational Bayesian can be used for effiecient approximate posterior inference in almost any model with continuous latent variable and/or parameters. And it is straightforward to optimize using standard stochastic gradient ascent technique.

For the case of i.i.d dataset and continuous latent variables per datapoint, the paper propose autoencoder VB. It uses SGVB estimator to optimize the recognition model that allow us to perform efficient approximate posterior inference using simple ancestral sampling => make inference and learning especially effcient, which in turn to allow us to efficiently learn the model parameters.

Learned posterior inference model, can also be used for a host of tasks

recognition
denoising
representation
visualization

When a neural network is used for the recognition model, we reach variational auto-encoder.

2 Method

This paper, we restrict the ourselves to the common case where

We have an i.i.d dataset with latent variable per datapoint,
We like to perform maximum likehood or maximum a posteriori inference on the global variables and variational inference on the latent variables

We can extend the scenario if we want. But the paper is focused on the common cases.

2.1 Problem scenario

review on representation learning in DL textbook at the beginning of chapter 15:

Feedforward network trained by supervised learning: last layer linear classifier or nearest neighbor classifier. The rest of the network learns to provide a representation to this classifier. Training with a supervised criterion naturally leads to the representatino at every hidden layer taking on properties that make the classification easier.

Just like superviser network, unsupervised deep learning algorithms have a main training objective but also learn a representation as a side effect. Most of representation learning problem face the trade off between

Researve as much info about input as possible
attain nice properties such as independence

Consider some dataset: X = xⁱ_i = 1^N consisting of N i.i.d samples of some discrete or continuous variable x.

We assume that the data are generated by some random random process, involving an unobserved continuous random variable z. This data generating processs consists of 2 step:

a value z⁽ⁱ⁾ is generated from some prior distribution p_θ^*(z).
a value x⁽ⁱ⁾ is generated from some conditional distribution p_θ^*(x|z). We assume that
1. the prior p_θ^*(z) and the likehood p_θ^*(x|z) come from parametric families of p_θ(z) and p_θ(x|z)
2. Their probability density functions (PDFs) are differentiable with respect to both θ and z

Unfortunately, a lot of this process is hidden from our view: the true parameters θ^* as well as the values of the latent variable z⁽ⁱ⁾ are unknown to us.

Very importantly, we do not make simplifying assumption about

the marginal p_θ(x)
the posterior p_θ(z|x)

This kind of assumption is very common to solve the challenge. But no! we are not going to use this strategy. So without this kind of simplying assumption, what kind of challenge are we facing:

Intractibility:
1. marginal likehood p_θ(x) = ∫p_θ(z)p_θ(x|z)dz is intractible
2. true posterior p_θ(z|x) = p_θ(x|z) * p_θ(z)/p_θ(x) is intractible
3. required integrals integrals for any reasonable mean-field VB algorithm are also intractable.
A large dataset: we have so much data that batch optimization is too costly. We would like to make parameter updates using small minibatch or even single datapoints. Sampling based solution, e.g. Monte Carlo, EM would in general be too slow, since it involves a typically expensive sampling loop per datapoint.

Review on intractable inference problem at Chapter 19 of DL textbook:

We are intersted in and propose solution to, 3 related problems in the above scenario:

Efficient approximate maximum likelihood (ML) or maximum a posteriori (MAP) estimation for the parameter θ.
Efficient approximate posterior inference of the latent variable z given an observed value of x for a choice of θ. This is useful for coding and data representation tasks.
Efficient approximate marginal inference of the variable x. This allow us to perform all kinds of tasks where a prior over x is required. Common applications in computer vision include image denoising, inpainting, and super resolution.

For the purpose of solving the above problems, we introduce a recognition model q_ϕ(z|x): an approximation to the intractable true posterior p_θ(z|x)

q_ϕ(z|x) probabslistic encoder: given datapoint x, it produces a distribution (e.g. a Gaussian) over the possible values of code z from which the datapoint x could have been generated.
p_θ(x|z) probabslistic decoder: given a code z, it produces a distribution over the possible corresponding value of x.

2.2 The variational bound

$$\begin{aligned} \begin {equation} \begin{split} KL[q_{\phi}(z|x) || p_{\theta}(z|x)] &= \sum_h q_{\phi}(z|x) \log \frac{q_{\phi}(z|x)}{p_{\theta}(z|x)} \\ \\\ &= \sum_h q_{\phi}(z|x) \log \frac{q_{\phi}(z|x)}{\frac {p_{\theta}(x, z)}{p_{\theta}(x)}} \\ \\\ &= \log p_{\theta}(x) + \sum_z q_{\phi}(z|x) (\log q_{\phi}(z|x) - log p_{\theta}(x, z) \\ \\\ &= \log p_{\theta}(x) + \sum_z q_{\phi}(z|x) (\log q_{\phi}(z|x) - log p_{\theta}(x | z) p_{\theta}(z) \\ \\\ &= \log p_{\theta}(x) + KL[q_{\phi}(z|x) || p_{\theta}(z)] - \sum_z q_{\phi}(z|x) log p_{\theta}(x | z) \end{split} \end {equation} \end{aligned}$$

Now we have

log p_θ(x) = KL[q_ϕ(z|x)||p_θ(z|x)] − KL[q_ϕ(z|x)||p_θ(z)] + ∑_zq_ϕ(z|x)logp_θ(x|z)

Because KL[q_ϕ(z|x)||p_θ(z|x)] > = 0 we have

$$\begin{aligned} \begin {equation} \begin{split} \log p_{\theta}(x) &\geq L(\theta, \phi; x) \\ \\\ &= - KL[q_{\phi}(z|x) || p_{\theta}(z)] + \sum_z q_{\phi}(z|x) log p_{\theta}(x | z) \\ \\\ &= - KL[q_{\phi}(z|x) || p_{\theta}(z)] + E_{z \sim q_{\phi}(z|x)} log p_{\theta}(x | z) \end{split} \end {equation} \end{aligned}$$

we want to differentiate and optimize lower bound w.r.t. both variational parameter ϕ and generative parameters θ.

To calculate a function w.r.t. ϕ using naive / Monte Carlo method we have:

$$\begin{aligned} \begin {equation} \begin{split} \nabla_{\phi} E_{z \sim q_{\phi}(z)} f(z) &= \sum_z \nabla_{\phi} q_{\phi}(z)f(z) \\ \\\ &= \sum_z f(z) \nabla_{\phi} q_{\phi}(z) \\ \\\ &= \sum_z f(z) q_{\phi}(z) \nabla_{\phi} \log q_{\phi}(z) \\ \\\ &= E_{z \sim q_{\phi}(z)} [f(z) \nabla_{\phi} \log q_{\phi}(z)] \\ \\ &\approx \frac{1}{L}\sum_{l=1}^L [f(z^{(l)}) \nabla_{\phi} \log q_{\phi}(z^{(l)})] \\ \\\ \end{split} \end {equation} \end{aligned}$$

where z^(l) ∼ q_ϕ(z).

This gradient estimator exhibits very high variance.

2.3 The SGVB estimator and AEVB algorithm

Under certain mild condition outlined in 2.4 for a chosen approximate posterior q_ϕ(z|x) we can reparameterize the random variable ẑ ∼ q_ϕ(z|x) using a differentiable transformation g_ϕ(ϵ, x) of an auxiliary noise variable ϵ:

$$\begin{aligned} \hat{z} = g_{\phi}(\epsilon , x) \\ with \\ \epsilon \sim p(\epsilon) \end{aligned}$$

We can now form Monte Carlo estimates of expectation of some function f(z) w.r.t. q_ϕ(z|x)

$$\begin{aligned} \begin {equation} \begin{split} E_{z \sim q_{\phi}(z|x)} f(z) &= E_{\epsilon \sim p(\epsilon)}[f(g_{\phi}(\epsilon , x))] \\ \\\ &\approx \frac{1}{L}\sum_{l=1}^L f(g_{\phi}(\epsilon ^{(l)} , x)) \end{split} \end {equation} \end{aligned}$$

Now we apply this technique to variational lower bound

$$\begin{aligned} \begin {equation} \begin{split} L(\theta, \phi; x^{(i)}) &= \sum_z q_{\phi}(z|x^{(i)}) (- \log q_{\phi}(z|x^{(i)}) + log p_{\theta}(x^{(i)}, z) ) \\ \\\ &\approx \frac{1}{L} \sum_{l=1}^L q_{\phi}(z^{(i, l)}|x^{(i)}) (- \log q_{\phi}(z^{(i, l)}|x^{(i)}) + log p_{\theta}(x^{(i)} | z^{(i, l)}) \end{split} \end {equation} \end{aligned}$$

where z^(i, l) = g_ϕ(ϵ^(i, l), x⁽ⁱ⁾) and ϵ ∼ p(ϵ)

Review that :

L(θ, ϕ; x) = − KL[q_ϕ(z|x)||p_θ(z)] + E_{z ∼ q_ϕ(z|x)}logp_θ(x|z)

The KL divergence can be integrated analystically, such that only the expected reconstruction error E_{z ∼ q_ϕ(z|x)}logp_θ(x|z) requires estimation by sampling.

The KL divergence can then be interpreted as regularizing ϕ, encouraging the approximate posterior q_ϕ(z|x) to be close to the prior p_θ(z). This yields a second version of the SGVB estimator

$$\hat{L}^B(\theta, \phi; x^{(i)}) = - KL[q_{\phi}(z|x^{(i)}) || p_{\theta}(z)] + \frac{1}{L} \sum_{l=1}^{L} (\log p_{\theta}(x^{(i)} | z^{(i, l)}))$$

where z^(i, l) = g_ϕ(ϵ^(i, l), x⁽ⁱ⁾) and ϵ ∼ p(ϵ)

Given multiple datapoints from a dataset X with N datapoints, we can construct an estimator of the marginal likehood lower bound of the full dataset, based on minibatch:

$$L(\theta, \phi; X) \approx \hat{L}^M(\theta, \phi; X^M = \frac{N}{M} \sum_{i=1}{M} \hat{L}(\theta, \phi; x^{(i)})$$

where the minibatch X^M = x⁽ⁱ⁾_i = 1^M is randomly drawn sample of M datapoints from the full dataset X with N dataponits.

The number of samples L per datapoint can be set to 1 as long as the minibatch size M is large enough.

Now we look back at the equation again, this time we connect this equation with autoencoder

$$\hat{L}^B(\theta, \phi; x^{(i)}) = - KL[q_{\phi}(z|x^{(i)}) || p_{\theta}(z)] + \frac{1}{L} \sum_{l=1}^{L} (\log p_{\theta}(x^{(i)} | z^{(i, l)}))$$

where z^(i, l) = g_ϕ(ϵ^(i, l), x⁽ⁱ⁾) and ϵ ∼ p(ϵ)

− KL[q_ϕ(z|x⁽ⁱ⁾)||p_θ(z)] serves as regularizer
$\frac{1}{L} \sum_{l=1}^{L} (\log p_{\theta}(x^{(i)} | z^{(i, l)}))$ serves as an expected negative reconstruction error. g_ϕ() is chosen such that it maps a datapoint x⁽ⁱ⁾ and a random noise vector ϵ^l to a sample from approximate posterior for the datapoint z^(i, l) = g_ϕ(x⁽ⁱ⁾, ϵ^i, l) is then the input to function log_θ(x⁽ⁱ⁾|z^(i, l)) which equals the probability density of datapoint x⁽ⁱ⁾ under the generative model, given z^(i, l). This term is a negative reconstruction error in auto-encoder parlance.

2.4 The reparameterization rick

Alternative method for generating samples from q_ϕ(z|x):

Let z be a continuous random variable and z ∼ q_ϕ(z|x)
Express the random variable as a deterministic variable z = g_ϕ(ϵ, x)
ϵ is an auxilary variable with independent marginal p(ϵ)
g_ϕ() is some vector valued function parameterized by ϕ

It is useful because it can be used to rewrite an expectation w.r.t q_ϕ(z|x) such that the Monte Carlo estimate of the expectation is differentiable w.r.t ϕ.

How to choose g_ϕ() and auxiliary variable:

Tractable inverse CDF
Analogous to the Gaussian example
Composition

3. Examples: Variational Auto-Encoder

In VAE, we use a neural network for the probablistic encoder q_ϕ(z|x). Below is the game settings of VAE:

p_θ(z) = N(z; 0, I) as a Multivariate Gaussian
Generator Network / Encoder p_θ(x|z) whose distribution parameters are computed from z with a MLP
- if x is real-valued data: p_θ(x|z) as a multivariate Gaussian
- if x is binary data: p_θ(x|z) as a Bernoulli
Inference Network / Decoder q_ϕ(z|x⁽ⁱ⁾) = log N(z; μ⁽ⁱ⁾, σ²⁽ⁱ⁾I) where the mean and standard deviation: μ⁽ⁱ⁾, σ⁽ⁱ⁾ is the outputs of of the encoding MLP

Please note that we can use many other way to approach the approximate of true posterior. The approximate is q_ϕ(z|x), it will still be Auto-Encoding Variational Baysian if we use many other methods. Variational Autoencoder believes that q_ϕ(z|x⁽ⁱ⁾) = log N(z; μ⁽ⁱ⁾, σ²⁽ⁱ⁾I). And also it belives that the parameterμ⁽ⁱ⁾, σ⁽ⁱ⁾ can be reparameterized. But does it have to be the multivariate Gaussian? No, it does not. VAE choose to believe that! The same applies to the generator / decoder in VAE.

Remember that the objective function is:

$$\hat{L}^B(\theta, \phi; x^{(i)}) = - KL[q_{\phi}(z|x^{(i)}) || p_{\theta}(z)] + \frac{1}{L} \sum_{l=1}^{L} (\log p_{\theta}(x^{(i)} | z^{(i, l)}))$$

Because in this model we believe / choose p_θ(z) and q_ϕ(z|x) are Gaussians. The KL divergence can be computed and differentiated without estimation.

$$\begin{aligned} \begin {equation} \begin{split} - KL[q_{\phi}(z|x^{(i)}) || p_{\theta}(z)] &= \int q_{\phi}(z|x^{(i)}) (\log p_{\theta}(z) - \log q_{\phi}(z|x^{(i)})) dz \\ \\\ &= \int q_{\phi}(z|x^{(i)}) \log p_{\theta}(z) dz - \int q_{\phi}(z|x^{(i)}) \log q_{\phi}(z|x^{(i)}) dz \end{split} \end {equation} \end{aligned}$$

Now we breakdown the last expression:

$$\begin{aligned} \begin {equation} \begin{split} \int q_{\phi}(z|x^{(i)}) \log p_{\theta}(z) dz &= \int N(z; \mu, \sigma^2) \log N(z; 0, I) dz \\\ &= -\frac{J}{2} \log(2 \pi) - \frac{1}{2} \sum_{j=1}^J(\mu_j^2 + \sigma_j^2) \end{split} \end {equation} \end{aligned}$$

$$\begin{aligned} \begin {equation} \begin{split} \int q_{\phi}(z|x^{(i)}) \log q_{\phi}(z|x^{(i)}) dz &= \int N(z; \mu, \sigma^2) \log N(z; \mu, \sigma^2) dz \\\ &= -\frac{J}{2} \log(2 \pi) - \frac{1}{2} \sum_{j=1}^J (1 + \log(\sigma_j^2)) \end{split} \end {equation} \end{aligned}$$

put them together:

$$\begin{aligned} \begin {equation} \begin{split} - KL[q_{\phi}(z|x^{(i)}) || p_{\theta}(z)] &= \int q_{\phi}(z|x^{(i)}) \log p_{\theta}(z) dz - \int q_{\phi}(z|x^{(i)}) \log q_{\phi}(z|x^{(i)}) dz \\\ &= - \frac{1}{2} \sum_{j=1}^J (1 + \log(\sigma_j^2) - \mu_j^2 - \sigma_j^2) \end{split} \end {equation} \end{aligned}$$

Now we can rewrite the estimated lower bound as:

$$\begin{aligned} \begin {equation} \begin{split} \hat{L}^B(\theta, \phi; x^{(i)}) &= - KL[q_{\phi}(z|x^{(i)}) || p_{\theta}(z)] + \frac{1}{L} \sum_{l=1}^{L} (\log p_{\theta}(x^{(i)} | z^{(i, l)})) \\ \\\ &= - \frac{1}{2} \sum_{j=1}^J (1 + \log((\sigma_j^{(i)})^2) - (\mu_j^{(i)})^2 - (\sigma_j^{(i)})^2) + \frac{1}{L} \sum_{l=1}^{L} (\log p_{\theta}(x^{(i)} | z^{(i, l)})) \end{split} \end {equation} \end{aligned}$$

where z^(i, l) = μ⁽ⁱ⁾ + σ⁽ⁱ⁾) ⊙ ϵ^(l) and ϵ^(l) ∼ N(0, I)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto Encoding Variational Bayes.rst

Auto Encoding Variational Bayes.rst

Paper: Auto-Encoding Variational Bayes

2 Method

2.1 Problem scenario

2.2 The variational bound

2.3 The SGVB estimator and AEVB algorithm

2.4 The reparameterization rick

3. Examples: Variational Auto-Encoder

Files

Auto Encoding Variational Bayes.rst

Latest commit

History

Auto Encoding Variational Bayes.rst

File metadata and controls

Paper: Auto-Encoding Variational Bayes

2 Method

2.1 Problem scenario

2.2 The variational bound

2.3 The SGVB estimator and AEVB algorithm

2.4 The reparameterization rick

3. Examples: Variational Auto-Encoder